Introduction

Preliminary info

This course will give a basic introduction on (untargeted) preprocessing of LC-MS (also extendable to GC-MS or LC-MS/MS) based metabolomics. The methods and workflow presented in this course our mainly based on the workflow presented by the xcms package on Bioconductor.

For this course, you will need R version > 4.4.

If you haven’t already, the packages needed to complete today’s course can be installed by running the next code block:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("xcms")
BiocManager::install("MsExperiment")
BiocManager::install("Spectra")

And load the required packages:

Review of LC-MS data

First, let’s do a quick review of what kind of data we are dealing with. LC-MS data is generated by first separating chemical molecules based on certain properties (e.g. polarity). Afterwards the molecules are ionized and in the MS instrument, they are separated based on their mass-to-charge ratio or m/z and their intensity (abundance) is measured. Thus, molecules are separated in two different dimensions, the retention time dimension (from the LC) and the mass-to-charge dimension (from the MS) making it easier to measure and identify molecules in more complex samples.

Animation by Johannes Rainer & Phillipine Louail https://jorainer.github.io/xcmsTutorials/ The workflow presented by xcms (but also applicable in for example MZmine3 and most untargeted algorithms) is:

  • chromatographic peak detection: process in which chromatographic peaks are identified within a sample (file).
  • alignment: process that adjusts for retention time differences (i.e. possible signal drifts from the LC) between measurements/samples.
  • correspondence: grouping of chromatographic peaks (presumably from the same ion) across samples/files into LC-MS features

Goals of the workshop

  • Learn how to preprocess LC-MS data from open-source formats (such as .mzML) in R using xcms and other open-source packages
  • Evaluate and understand the results of the preprocessing workflow
  • Be able to build analysis workflows specifically tailored to your data

Example dataset

In this course we are using a subset of the data presented in Lloyd-Price et al. (2019). The subset is available to download via Metaboanalyst (https://new.metaboanalyst.ca/MetaboAnalyst/upload/SpectraUpload.xhtml) and contains 10 spectra (UPLC-Q/E-ESI-, C18) organized into three groups (Healthy, Crohn’s Disease and QC). The data is already available in the open-source .mzML format.

Importing LC-MS data

# Read the file names from the data folder
files <- list.files("data/",full.names = TRUE,pattern = ".mzML")

# Import the available metdata with the groups
metadata <- read.table("data/trimmed_metadata.txt",header = TRUE)
# Re-order metadata to match the files sequence
metadata <- metadata[order(metadata$Sample),]


#' Import the data of the experiment
ms_data <- readMsExperiment(files, sampleData = metadata)

Our object contains 4461 MS1 spectra in 10 samples

ms_data
## Object of class MsExperiment 
##  Spectra: MS1 (4461) 
##  Experiment data: 10 sample(s)
##  Sample data links:
##   - spectra: 10 sample(s) to 4461 element(s).

Various functions can be used to access specific data from the object:

sampleData(ms_data)
## DataFrame with 10 rows and 3 columns
##         Sample     Disease   spectraOrigin
##    <character> <character>     <character>
## 1     CD-6KUCT     Control C:\\Users\\p...
## 2     CD-77FXR     Control C:\\Users\\p...
## 3     CD-9OS5Y     Control C:\\Users\\p...
## 4     CD-9WOBP     Control C:\\Users\\p...
## 5     HC-9SNJ4     Disease C:\\Users\\p...
## 6     HC-9X47O     Disease C:\\Users\\p...
## 7     HC-AMR37     Disease C:\\Users\\p...
## 8     HC-AUP8B     Disease C:\\Users\\p...
## 9   QC_PREFA02          QC C:\\Users\\p...
## 10  QC_PREFB02          QC C:\\Users\\p...
spectra(ms_data)
## MSn data (Spectra) with 4461 spectra in a MsBackendMzR backend:
##        msLevel     rtime scanIndex
##      <integer> <numeric> <integer>
## 1            1   480.036         1
## 2            1   480.303         2
## 3            1   480.571         3
## 4            1   480.838         4
## 5            1   481.106         5
## ...        ...       ...       ...
## 4457         1   598.670       442
## 4458         1   598.997       443
## 4459         1   599.265       444
## 4460         1   599.532       445
## 4461         1   599.800       446
##  ... 33 more variables/columns.
## 
## file(s):
## CD-6KUCT.mzML
## CD-77FXR.mzML
## CD-9OS5Y.mzML
##  ... 7 more files

Centroiding data

MS instruments allow to export data in profile or centroid mode. Profile data contains the signal for all discrete m/z values (and retention times) for which the instrument collected data (see first figure below). MS instruments continuously sample and record signals, therefore a mass peak for a single ion in one spectrum will consist of multiple intensities at discrete m/z values. The process to reduce this distribution of signals to a single representative mass peak (the centroid, see second figure below) is called centroiding.

Profile data
Profile data
Centroid data
Centroid data

In this case, we need to inspect if the data we are analyzing is profile data (and thus still needs to be centroided for use with xcms) or not. Usually, centroiding is performed when converting the data from the proprietary data fromat to .mzML, but it can also be performed in R (see https://jorainer.github.io/xcmsTutorials/articles/xcms-preprocessing.html#centroiding-of-profile-ms-data).

We can easily check if the data is centroided using the function isCentroided. Checking in one file should be sufficient.

isCentroided(spectra(ms_data[1]))
##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [226] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [256] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [286] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [316] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [331] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [346] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [376] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [391] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [406] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [421] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [436] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The function returns TRUE for all spectra in the sample.

Visually, we can check some of the centroided data for a known compound in the data (see suppplementary info of the Lloyd-Price paper).

Method mz RT (min) Metabolite
C18n 538.3212 8.66 Met - Cholate
#Here we extract the m/z and RT range for our compound in the two QC samples using filterRT & filterMzRange, afterwards we plot.
ms_data[c(9,10)] |>
    filterSpectra(filterMzRange,538.3212 + c(-0.005, 0.005)) |>
    filterSpectra(filterRt,519.6 + c(-10,10)) |>
    plot()

Basic data inspection

We can inspect the total ion chromatogram using chromatogram. Setting aggregationFun = “sum” allows to calculate the total ion chromatogram (TIC), aggregationFun = “max” the base peak chromatogram (BPC).

#' Extract and plot a BPC
tic <- chromatogram(ms_data, aggregationFun = "sum")
plot(tic,main = "Total Ion Chromatogram")

We can also create boxplots representing the distribution of the total ion currents per data file. Such plots can be very useful to spot potentially problematic MS runs.

library(RColorBrewer)
## Generate a color set for the three groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:3], "60")
names(group_colors) <- c("Control", "Disease","QC")

## Get the total ion current by file
tc <- spectra(ms_data) |>
    tic() |>
    split(f = fromFile(ms_data))
boxplot(tc, col = group_colors[sampleData(ms_data)$Disease],
        ylab = "intensity", main = "Total ion current")

Data preprocessing

In this section, we will go through the 3 main steps of LC-MS preprocessing: peak picking, alignment and correspondence (and gap filling). Remember that you should tailor the parameters and settings to your specific dataset in you are refering to this document in the future.

Chromatographic peak detection

Chromatographic peak detection aims to identify peaks along the retention time axis that represent the signal from individual compounds’ ions. This involves identifying and quantifying such signals as shown in the sketch below.

Peak detection, sketch by J.Rainer & P. Louail
Peak detection, sketch by J.Rainer & P. Louail

Within xcms, several peak detection algorithms can be used and accessed by their respective parameter objects: MatchedFilterParam, CentWaveParam and MassifquantParam. However, in this workshop we will focus on the centWave algoritm (see the original publication (Tautenhahn, Böttcher, and Neumann 2008) for more details) as it is most frequently used in LC-MS metabolomics applications.

The different parameters available for the centWave algorithm can be called with CentWaveParam

CentWaveParam()
## Object of class:  CentWaveParam 
##  Parameters:
##  - ppm: [1] 25
##  - peakwidth: [1] 20 50
##  - snthresh: [1] 10
##  - prefilter: [1]   3 100
##  - mzCenterFun: [1] "wMean"
##  - integrate: [1] 1
##  - mzdiff: [1] -0.001
##  - fitgauss: [1] FALSE
##  - noise: [1] 0
##  - verboseColumns: [1] FALSE
##  - roiList: list()
##  - firstBaselineCheck: [1] TRUE
##  - roiScales: numeric(0)
##  - extendLengthMSW: [1] FALSE
##  - verboseBetaColumns: [1] FALSE

It is strongly discouraged to use these default parameter settings for any of your preprocessing! There are tools available that automatically configure these parameters based on your data (such as IPO and autoTuner), but generally, the best results are obtained by tuning the parameters manually to your data and ions of interest.

Retention time alignment

Peak alignment, sketch by J.Rainer & P. Louail
Peak alignment, sketch by J.Rainer & P. Louail

Correspondence

Correspondence, sketch by J.Rainer & P. Louail
Correspondence, sketch by J.Rainer & P. Louail

Gap filling

Data evaluation

Conclusion and further analysis